GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

您所在的位置:网站首页 gcnet:Graph Completion GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation

2024-07-13 15:29| 来源: 网络整理| 查看: 265

[1] Y. Liang, F. Meng, Y. Zhang, Y. Chen, J. Xu, and J. Zhou, “Emotional conversation generation with heterogeneous graph neural network,” Artificial Intelligence, vol. 308, p. 103714, 2022.

[2] T. Fu, S. Gao, X. Zhao, J.-r. Wen, and R. Yan, “Learning towards conversational ai: A survey,” AI Open, pp. 14–28, 2022.

[3] L. Nie, W. Wang, R. Hong, M. Wang, and Q. Tian, “Multimodal dialog system: Generating responses via adaptive decoders,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1098–1106.

[4] C. Gao, W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Advances and challenges in conversational recommender systems: A survey,” AI Open, vol. 2, pp. 100–126, 2021.

[5] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016.

[6] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.

[7] Y. Yang, D.-C. Zhan, X.-R. Sheng, and Y. Jiang, “Semi-supervised multi-modal learning with incomplete modalities.” in IJCAI, 2018, pp. 2998–3004.

[8] Z. Xue, J. Du, D. Du, W. Ren, and S. Lyu, “Deep correlated predictive subspace learning for incomplete multi-view semi-supervised classification.” in IJCAI, 2019, pp. 4026–4032.

[9] M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310.

[10] H. Pham, P. P. Liang, T. Manzini, L.-P. Morency, and B. P´oczos, “Found in translation: Learning robust joint representations by cyclic translations between modalities,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 6892–6899.

[11] L. Zhang, Y. Zhao, Z. Zhu, D. Shen, and S. Ji, “Multi-view missing data completion,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 7, pp. 1296–1309, 2018.

[12] Q. Suo, W. Zhong, F. Ma, Y. Yuan, J. Gao, and A. Zhang, “Metric learning on healthcare data with incomplete modalities.” in IJCAI, 2019, pp. 3534–3540.

[13] Y. Liu, L. Fan, C. Zhang, T. Zhou, Z. Xiao, L. Geng, and D. Shen, “Incomplete multi-modal representation learning for alzheimer’s disease diagnosis,” Medical Image Analysis, vol. 69, pp. 1–11, 2021.

[14] Z. Lian, J. Tao, B. Liu, and J. Huang, “Conversational emotion anal- ysis via attention mechanisms,” in Proceedings of the Interspeech, 2019, pp. 1936–1940.

[15] Z. Lian, B. Liu, and J. Tao, “Decn: Dialogical emotion correction network for conversational emotion recognition,” Neurocomputing, vol. 454, pp. 483–495, 2021.

[16] C. Zhang, Y. Cui, Z. Han, J. T. Zhou, H. Fu, and Q. Hu, “Deep partial multi-view learning,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 44, no. 05, pp. 2402–2415, 2022.

[17] J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for emotion recognition with uncertain missing modalities,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2608– 2618.

[18] J. Chen and A. Zhang, “Hgmf: heterogeneous graph-based fusion for multimodal data with incompleteness,” in Proceedings of the

26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1295–1305.

[19] S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” in Companion Publication of the 2020 International Conference on Multimodal Interaction, 2020, pp. 400–404.

[20] F. Ma, X. Xu, S.-L. Huang, and L. Zhang, “Maximum likelihood estimation for multimodal learning with missing modality,” arXiv preprint arXiv:2108.10513, 2021.

[21] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. Lee Giles, “Learning to extract semantic structure from documents using multimodal fully convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5315–5324.

[22] P. P. Liang, Z. Liu, Y.-H. H. Tsai, Q. Zhao, R. Salakhutdinov, and L.- P. Morency, “Learning representations from imperfect time series data via tensor rank regularization,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL, 2019, pp. 1569–1576.

[23] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on optimization, vol. 20, no. 4, pp. 1956–1982, 2010.

[24] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algorithms for learning large incomplete matrices,” The Journal of Machine Learning Research, vol. 11, pp. 2287–2322, 2010.

[25] H. Fan, Y. Chen, Y. Guo, H. Zhang, and G. Kuang, “Hyperspectral image restoration using low-rank tensor recovery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 10, pp. 4589–4604, 2017.

[26] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.

[27] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extract- ing and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.

[28] L. Tran, X. Liu, J. Zhou, and R. Jin, “Missing modalities imputation via cascaded residual autoencoder,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1405–1414.

[29] Q. Wang, Z. Ding, Z. Tao, Q. Gao, and Y. Fu, “Partial multi-view clustering via consistent gan,” in IEEE International Conference on Data Mining (ICDM), 2018, pp. 1290–1295.

[30] L. Cai, Z. Wang, H. Gao, D. Shen, and S. Ji, “Deep adversarial learning for multi-modality missing data completion,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1158–1166.

[31] O. Ivanov, M. Figurnov, and D. Vetrov, “Variational autoencoder with arbitrary conditioning,” in Proceedings of the 7th International Conference on Learning Representations, 2019, pp. 1–25.

[32] C. Du, C. Du, H. Wang, J. Li, W.-L. Zheng, B.-L. Lu, and H. He, “Semi-supervised deep generative modelling of incomplete multi-modality emotional data,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 108–116.

[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.

[34] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature reconstruction network for robust multimodal sentiment analysis,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4400–4407.

[35] Y. Duan, Y. Lv, W. Kang, and Y. Zhao, “A deep learning based approach for traffic data imputation,” in 17th International IEEE conference on intelligent transportation systems (ITSC), 2014, pp. 912– 917.

[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[37] L. Yuan, Y. Wang, P. M. Thompson, V. A. Narayan, J. Ye, A. D. N. Initiative et al., “Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data,” NeuroImage, vol. 61, no. 3, pp. 622–632, 2012.

[38] S. Xiang, L. Yuan, W. Fan, Y. Wang, P. M. Thompson, and J. Ye, “Multi-source learning with block-wise missing data for alzheimer’s disease prediction,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 185–193.

[39] Y. Li, T. Yang, J. Zhou, and J. Ye, “Multi-task learning based survival analysis for predicting alzheimer’s disease progression with multi-source block-wise missing data,” in Proceedings of the 2018 SIAM international conference on data mining. SIAM, 2018, pp. 288–296.

[40] H. Hotelling, “Relations between two sets of variates,” in Breakthroughs in statistics. Springer, 1992, pp. 162–190.

[41] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis,” in International conference on machine learning. PMLR, 2013, pp. 1247–1255.

[42] F. Ma, S.-L. Huang, and L. Zhang, “An efficient approach for audio-visual emotion recognition with missing labels and missing modalities,” in 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021, pp. 1–6.

[43] Y. Lin, Y. Gou, Z. Liu, B. Li, J. Lv, and X. Peng, “Completer: Incomplete multi-view clustering via contrastive prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 174–11 183.

[44] W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi- view representation learning,” in International conference on machine learning. PMLR, 2015, pp. 1083–1092.

[45] A. Zadeh, Y.-C. Lim, P. P. Liang, and L.-P. Morency, “Variational auto-decoder: A method for neural generative modeling from incomplete data,” arXiv preprint arXiv:1903.00840, 2019.

[46] A. Zadeh, S. Benoit, and L.-P. Morency, “Relay variational in- ference: A method for accelerated encoderless vi,” arXiv preprint arXiv:2110.13422, 2021.

[47] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in usergenerated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1, 2017, pp. 873–883.

[48] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 6818–6825.

[49] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 7370–7377.

[50] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in Proceedings of the 6th International Conference on Learning Representations, ICLR, 2018, pp. 1–12.

[51] Z. Lian, B. Liu, and J. Tao, “Ctnet: Conversational transformer net- work for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.

[52] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference. Springer, 2018, pp. 593–607.

[53] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.

[54] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2122–2132.

[55] S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 164–172.

[56] Z. Lian, Y. Li, J. Tao, and J. Huang, “Speech emotion recognition via contrastive loss under siamese networks,” in Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data, 2018, pp. 21–26.

[57] Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, and B. W. Schuller, “Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition,” Neural Networks, vol. 141, pp. 52–60, 2021.

[58] F. Soldner, V. P´erez-Rosas, and R. Mihalcea, “Box of lies: Mul- timodal deception detection in dialogues,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1768–1777.

[59] D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality- invariant and-specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.

[60] Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relation- ships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8992– 8999.

[61] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[62] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Proceedings of the Interspeech, 2019, pp. 3465–3469.

[63] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on speaker verification and language identification,” arXiv preprint arXiv:2012.06185, 2020.

[64] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in Proceedings of the 8th International Conference on Learning Representations, 2020, pp. 1–21.

[65] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.

[66] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

[67] Z. Zhao, Q. Liu, and S. Wang, “Learning deep global multi-scale and local attention features for facial expression recognition in the wild,” IEEE Transactions on Image Processing, vol. 30, pp. 6544–6556, 2021.

[68] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.

[69] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Proceedings of the Advances in Neural Information Processing Systems, 2007, pp. 153–160.

[70] L. Z. Wong, H. Chen, S. Lin, and D. C. Chen, “Imputing missing values in sensor networks using sparse data representations,” in Proceedings of the 17th ACM international conference on Modeling, analysis and simulation of wireless and mobile systems, 2014, pp. 227– 230.

[71] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019, pp. 154– 164.

[72] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675.

[73] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, pp. 2579–2605, 2008.

Zheng Lian received the B.S. degree from the Beijing University of Posts and Telecommunications, Beijing, China, in 2016. And he received the Ph.D degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2021. He is currently an Assistant Professor at National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include affective computing, deep learning and multimodal emotion recognition.

Lan Chen received the B.S. degree from the China University of Petroleum, Beijing, China, in 2016. And she received the Ph.D degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2022. Her current research interests include computer graphics and image processing.

image

Licai Sun received the B.S. degree from Beijing Forestry University, Beijing, China, in 2016, and the M.S. degree from University of Chinese Academy of Sciences, Beijing, China, in 2019. He is currently working toward the Ph.D degree with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China. His current research interests include affective computing, deep learning and multimodal representation learning.

Bin Liu received his the B.S. degree and the M.S. degree from Beijing institute of technology, Beijing, China, in 2007 and 2009 respectively. He received Ph.D. degree from the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2015. He is currently an Associate Professor in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include affective computing and audio signal processing.

Jianhua Tao received the Ph.D. degree from Tsinghua University, Beijing, China, in 2001, and the M.S. degree from Nanjing University, Nanjing, China, in 1996. He is currently a Professor with Department of Automation, Tsinghua University, Beijing, China. He has authored or coauthored more than eighty papers on major journals and proceedings. His current research interests include speech recognition, speech synthesis and coding methods, human–computer interaction, multimedia information processing, and pattern recognition. He is the Chair or Program Committee Member for several major conferences, including ICPR, ACII, ICMI, ISCSLP, NCMMSC, etc. He is also the Steering Committee Member for the IEEE Transactions on Affective Computing, an Associate Editor for Journal on Multimodal User Interface and International Journal on Synthetic Emotions, and the Deputy Editor-in-Chief for Chinese Journal of Phonetics. He was the recipient of several awards from the important conferences, such as Eurospeech, NCMMSC, etc.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3